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Abstract 

The isochore concept in human genome sequence was challenged in an analysis by the 
International Human Genome Sequencing Consortium (IHGSC). We argue here that a 
statement in IGHSC analysis concerning the existence of isochore is incorrect, because it 
had applied an inappropriate statistical test. To test the existence of isochores should be 
equivalent to a test of homogeneity of windowed GC%. The statistical test applied in the 
IHGSC's analysis, the binomial test, is however a test of a sequence being random on the 
base level. For testing the existence of isochore, or homogeneity in GC%, we propose to 
use another statistical test: the analysis of variance (ANOVA). It can be shown that DNA 
sequences that are rejected by binomial test may not be rejected by the ANOVA test. 

Background 

The degree of homogeneity in base composition in human genome is a fundamental 
property of the genome sequence. Not only does it characterize the organization and 
evolution of the genome, but also it provides a context of many practical sequence analysis. 
Statistical quantities such as GC%, used for sequence analyses such as computational gene 
recognition, should be sampled from a homogeneous region of the sequence. If these 
quantities are sampled from an inhomogeneous region, error is introduced and the quality 
of a sequence analysis such as the performance of gene prediction, could be affected. 

It has been known for a long time from the work of Bernardi's group that there are 
compositional homogeneous regions in human genome with sizes of at least 200-300 kb 
fl], ^. These homogeneous regions are called "isochores" and the whole genome is a 
mosaic of isochores. Recently, however, this view of human genome is questioned in an 
initial analysis of human genome draft sequence . The analysis presumably shows that 
no sequence of 300-kb length examined could be claimed to be homogeneous ("... the 
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hypothesis of homogeneity could be rejected for each 300-kb window in the draft genome 
sequence", page 877 of [Q, and a stunning statement was made that, essentially, isochore 
concept does not hold ("... isochores do not appear to merit the prefix 'iso'", page 877 of 

!)• 

The purpose of this Letter is to show that an incorrect statistical distribution for win- 
dowed GC% is assumed in [Q, based on an unrealistic condition for DNA sequences. As a 
result, the statistical test used in is invalid. We will present a correct statistical test, 
assuming a more reasonable statistical distribution of windowed GC%. Under the new 
test, the conclusion concerning the existence of isochore is drastically altered. Although 
our testing result may still depend on the window size at which GC% is sampled, and may 
possibly depend on the choice of GC% groups, it is clear that the test in is too biased 
towards rejecting the homogeneity null hypothesis, and sequences that fail the test in |^ 
usually do not fail our new test. 

Results 

For a sequence to be homogeneous in GC%, the mean/average of windowed GC% values 
sampled from one region of the sequence should be similar to that in another region, with a 
consideration on the amount of allowed variance. In other words, to claim that a sequence 
is homogeneous, not only do we need to calculate means of GC% along the sequence, but 
also we need to know the variance. Generally speaking, the mean and the variance are two 
independent parameters of a statistical distribution. However, for the homogeneity test 
in 1^, the variance is assumed to be a function of the mean, thus it is not independently 
estimated. 

In , the windowed GC% is assumed to follow a binomial distribution. For a binomial 
distribution to be true, bases within the window should be uncorrelated, similar to tossing 
a coin many times. Violating this assumption invalids the use of binomial application. The 
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more reasonable statistical distribution of GC% should be the normal distribution which, 
unlike the binomial distribution, has two independent parameters (mean and variance). 
Mean value can be estimated from a window, whereas variance can be estimated from a 
group of windows. 

To illustrate our point, we analyze two well known isochore sequences, the Major His- 
tocompatibility Complex (MHC) class III and class II sequences on human chromosome 
6 p, H, 0, 1^), with lengths 642.1 kb and 900.9 kb, respectively. The exact borders of 
the two isochore sequences are determined by a segmentation procedure |T0[ and an 
online resource on isochore mapping Jill). ^^^^ repeat the test in that these two 
sequences, when viewed as a collection of many 20 kb windows, are sampled from a bino- 
mial distribution. According to Q], a rejection of this test is considered to be an evidence 
for heterogeneity. The test results are included in Table 1, which clearly shows that the 
variances of GC% values sampled from 20-kb windows are much larger than expected from 
a binomial distribution, with p- value close to be (< 10^^°). 

This result, that the variance of GC% sampled from windows is much larger than 
expected by binomial distribution, has been known for a long time [|l^, |13|, ^, [|l^ (and 
the references therein). It is not surprising that the binomial distribution assumption is 
rejected even for isochore sequences as shown in Table 1. Nevertheless, this rejection only 
shows that a 20-kb window is not a series of 20000 uncorrelated bases; it is not a rejection 
of homogeneity of windowed GC% along the sequence. 

To reaffirm our belief that the binomial test used in is a test of randomness of the 
sequence instead of homogeneity, one bacterial sequence [Borrelia burgdorferi, 910.7 kb) 
and two randomly generated sequences (with same length and base composition as the 
MHC class HI and class II sequences) are used for test. Table 1 shows that the null 
hypothesis cannot be rejected by the binomial test for the two random sequences, but it 
is rejected for the Borrelia burgdorferi, a particularly homogeneous genome, as shown in a 
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recent survey of archaeal and bacterial genome heterogeneity . 

We would like to suggest that the more reasonable statistical distribution of windowed 
GC% is the normal/ Gaussian distribution, and the more appropriate test of homogeneity 
of these GC% values along a sequence is the analysis of variance (ANOVA). There are at 
least two reasons to believe that ANOVA is the more appropriate test. First, it is a test of 
equality between means, which is identical to the intuitive meaning of homogeneity, i.e., 
GC% are the same along the sequence. Second, ANOVA and normal distribution reflects 
the real situation of DNA sequences that these are not random sequences, and windowed 
GC%'s exhibit higher values of variances. ANOVA allows the variance to be estimated 
from the data, rather than being fixed by the mean value as in binomial distribution. 
ANOVA was previously applied to the study of inter-chromosomal homogeneity of yeast 
genome [0, [T^ . 



To apply ANOVA to test homogeneity, we split a sequence into several super-windows, 
and several windows per super-window. GC% from each window is calculated. The null 
hypothesis is that the mean of windowed GC%'s in each super- window is the same. The 
simplest selection of super-windows and windows is to assume all windows to have the 
same length. To match the discussions in @], we choose 20-kb windows and 300-kb super- 
windows. This corresponds to roughly 2 super-windows, 16 windows per super-window for 
the MHC class III sequence, and 3 super-windows, 15 windows per super-window for the 
MHC class II sequence. ANOVA test results of these two isochores are listed in Table 2. 
The p-values are 0.192 and 0.323, respectively, for MHC class III and class II sequence. 
The null hypothesis, that means of GC% in different super-windows are the same, is not 
rejected. 

When the ANOVA test is applied to the Borrelia burgdorferi genome sequence and 
two randomly generated sequences, null hypothesis cannot be rejected, indicating that all 
three sequences are homogeneous at the respective window and super-window sizes (20 kb 
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and 300 kb). This is a more satisfactory situation than the binomial test because now a 
homogeneous bacterial sequence is indeed confirmed to be homogeneous by the test. 

Discussions 



Due to the "domains within domains" phenomenon in DNA sequences |18, IS, 20 1, 
we should not assume automatically that a homogeneity test result obtained at 20-kb 
window and 300-kb super-window will hold true for other window and super-window sizes. 
To check this, we carry out ANOVA tests on the MHC class III and class II sequences 
at other window and super- window sizes. Fig.l shows the result for the ANOVA test 
result (— log]^Q(p— value) ) for window sizes of around 20 kb, 10 kb, 5 kb, and 2.5 kb, and 
the sequence is partitioned into 2, 3, 5, 8 (2,3,5,9) super-windows for MHC class III (II) 
sequence. 

Several observations could be made from Fig.l. First, when GC%'s are sampled from 
(e.g.) 20-kb windows, changing the number of super-windows (i.e. number of partitions of 
the sequence) does not greatly influence the ANOVA test result. This change corresponds 
to a regrouping of windowed GC%'s. Generally speaking, if the sequence is homogeneous 
with all GC% values (taken from a fixed window size) having the similar value, regrouping 
these values does not make an insignificant result to be significant. 

Second, the ANOVA test becomes more significant when the window size decreases. 
This observation is understandable because at smaller length scales, GC% fluctuations are 
no longer averaged out. These smaller-length-scale fluctuations could be due to repeats, 
insertions, foreign elements, etc. For MHC class II sequence, as the subwindow size is 
reduced to around 2.5 kb, the ANOVA test result is typically significant (Fig.l). This 
is consistent to the definition of isochores as "fairly homogeneous" (as versus "strictly 
homogeneous") segments above a size of 3 kb and justifies the "coarse graining" 

procedure to locate isochore boundaries in P|. 
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Third, two isochore sequences may look similar at one length scale (e.g. 20 kb), but 
quite different at another length scale. Fig.l shows that MHC class II sequence is more 
heterogeneous than MHC class III sequence when viewed at the 2-10 kb length scales. It 
is known that GC-poor sequences are generally considered to be more homogeneous than 
GC-rich sequences, or more accurately, a sequence with a GC% closer to 50% is more 
heterogeneous than a sequence whose GC% is far away from 50% |jl3|, ||, |l5l. Since the 



GC% of MHC class HI and II sequence is 51.9% and 41.1%, respectively, we might expect 
MHC class II sequence to be more homogeneous than class HI sequence. Interestingly, 
Fig.l shows the contrary. 

To conclude, the binomial test used in should not be a test of homogeneity if the 
expected variance does not reflect the true variance in the sequence. The reason that the 
expected variance in a binomial test (which is derived from the mean GC% instead of 
being an independent parameter) is unrealistic is because the underlying base sequence is 
not random/uncorrelated. We are naturally led to the ANOVA test if we actually estimate 
the variance from the data. With ANOVA tests, it is clear that homogeneous regions of 
GC% in human genome do exist; in other words, isochores exist. 



Methods 

Binomial test: Following a binomial test is applied to many GC% values mea- 
sured from a fixed-sized window (e.g. 20 kb). For example, if the sequence length is 
900 kb, there are n =45 such 20-kb windows and 45 GC% values. The variance of these 
GC%'s (o"^) is calculated, and the variance as expected from a binomial distribution is 
ctq = Tn{l — m)/20000, where m is probability of G or C. The value of m can be estimated 
by the actual GC% of the sequence. The test statistic is = (n — 1)(t^/(Tq. For null 
hypothesis (that windowed GC% measurements do follow binomial distribution, which is 
true when the underlying base sequence is random/uncorrelated within the window), 
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follows the Xdf=n-i distribution (e.g. Xdf=4:4: example). For any given value, the 

p-value can be determined by the corresponding distribution. 

ANOVA test: ANOVA test (analysis of variance) is applied to several groups of 
GC%'s (as a comparison, binomial test is only apphed to one group of GC%'s). The 
concept of "group" and "member" in ANOVA now becomes "super-window" and "window" 
here. The number of super-windows partitioned in a sequence is a, and the number of 
windows in the super- window i is n^. The two "sum of squares" (SS) are defined: SS^ — 
E-=i E]LiiGC%,j -GC%i)^ (within a group), and SS, = E-=i n^(GC%i-W%f (among 
groups). The test statistic is F = SSa/SS^ x E^=i(^« ~ l)/(ct — !)• The distribution of 
F under null (i.e., GC%i=GC%2= • ■ • GC%a) is known, and this distribution can be used 
to determined the p-value. 
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Figure 1: The — logio(p— value) of ANOVA tests as a function of the window sizes, for MHC class III (left) 
and MHC class H (right) sequences. These tests with the same number of super-windows are connected 
in a line. The size of the super-window and the number of super-windows in the sequence is indicated for 
each line. 
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seq 


# win (n) 


mean 


var 


binomial var ctq 


2 / 2 

a /(To 


c2 = (n - 1)ctV(7§ 


p- value 


MHC class III 


32 


0.5188 


0.0005345 


0.00001248 


42.8215 


1327.47 





MHC class II 


45 


0.4105 


0.0007268 


0.00001210 


60.0709 


2703.19 





random (class III) 


32 


0.5185 


0.00001137 


0.00001248 


0.9110 


28.2402 


0.609 


random (class 11) 


45 


0.4106 


0.00001255 


0.00001210 


1.0369 


45.6244 


0.404 


B. burgdorferi 


45 


0.2859 


0.0001515 


0.00001021 


14.8432 


653.099 






Table 1: Testing the hypothesis that GC% values sampled from 20- kb windows follow a 
binomial distribution. Five sequences arc tested: MHC class 111 and MHC class 11 isochore sequences, 
two random sequences similar these two MHC sequences (same length and same base composition), and 
bacterium Borrelia burgdorferi genome sequence. Detailed explanation of column headers: 1. Sequence 
name. 2. Total number of windows in the sequence (n), with each contributing a GC% value. 3. Mean of 
the GC% (m). 4. Variance of the GC% (cr^). 5. Variance of GC% expected from a binomial distribution 
(ctq = m(l - m)/20000). 6. Ratio of the two variances ct^/o-q. 7. test statistic = (n - l)a'^/aQ. 8. 
p- value from the binomial distribution test. 
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df SS 


MS 


F-value 


p- value 


MHC class III (sw= 


=2, w=16) 








between windows 


1 0.0009159 


0.0009159 


1.781 


0.192 


within windows 


30 0.01543 


0.0005143 






MHC class II (sw= 


=3, w=15) 








between windows 


2 0.001658 


0.0008288 


1.162 


0.323 


within windows 


42 0.02997 


0.0007137 






random seq similar to class III (sw=2, 


w=16) 






between windows 


1 0.00000288 


0.00000288 


0.247 


0.623 


within windows 


30 0.0003496 


0.00001165 






random seq similar to class II (sw=3, 


w=15) 






between windows 


2 0.00004546 


0.00002273 


1.884 


0.165 


within windows 


42 0.0005066 


0.00001206 






B. burgdorferi (sw; 


=3, w=15) 








between windows 


2 0.0002064 


0.0001032 


0.671 


0.517 


within windows 


42 0.006461 


0.0001538 







Table 2: ANOVA test results of the five sequences (two MHC isochore sequences and their ran- 
domized sequences, and bacterium Borrelia burgdorferi sequence), df: degrees of freedom; SS: sum of 
squares. MS: mean squares. F-value: test statistic value; p-value: p-value from the ANOVA test, sw and 
w are the number of super-windows and windows. 



